Day 22 [Python ML、資料視覺化] 散佈圖

2021 iThome 鐵人賽

DAY 22

AI & Data

使用python學習Machine Learning系列第 22 篇

13th鐵人賽

guancioul

團隊人工逗點智慧

2021-10-05 08:27:32

4294 瀏覽

分享至

設定jupyter notebook

import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

Setup Complete

讀取資料和顯示資料

# Path of the file to read
insurance_filepath = "./insurance.csv"

# Read the file into a variable insurance_data
insurance_data = pd.read_csv(insurance_filepath)

讀取完資料後，可以將其前5筆資料印出

insurance_data.head()

散佈圖

要創建一個簡單的散佈圖，需要先設定x軸跟y軸需要的資料

sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'])

<AxesSubplot:xlabel='bmi', ylabel='charges'>

以上圖來說，BMI越高的人，被收的費用理論上也會越多

可以在圖表中多加一條回歸線(regression line)，可以確保猜測是對的

sns.regplot(x=insurance_data['bmi'], y=insurance_data['charges'])

<AxesSubplot:xlabel='bmi', ylabel='charges'>

Color-coded scatter plots

若我們想在圖表中看出吸菸(smoke)跟BMI還有收費(charge)之間的關係，可以將圖表中加入顏色

sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'], hue=insurance_data['smoker'])

<AxesSubplot:xlabel='bmi', ylabel='charges'>

這時我們可以使用sns.lmplot來看出這兩個區間的回歸線差別

sns.lmplot(x="bmi", y="charges", hue="smoker", data=insurance_data)

<seaborn.axisgrid.FacetGrid at 0x7f894623c080>

可以看出吸菸者的回歸線比沒有吸菸的人高陡峭很多

sns.lmplot跟之前遇到的產生圖表的方法有些許的差異

之前取x軸的方法為x=insurance_data['bmi']，在這個方法中只需要用x="bmi"
y軸跟hue也是
使用data=insurance_data可以讀取檔案

若是想做categorical scatter plot的圖表，可以使用swarmplot來繪製圖表

sns.swarmplot(x=insurance_data['smoker'],
              y=insurance_data['charges'])

/opt/conda/lib/python3.6/site-packages/seaborn/categorical.py:1296: UserWarning: 67.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)





<AxesSubplot:xlabel='smoker', ylabel='charges'>